Local Consistency of Markov Chain Monte Carlo 

Methods 

Kengo KAMATANFt 

Abstract 

In this paper, we introduce the notion of efficiency (consistency) and 
examine some asymptotic properties of Markov chain Monte Carlo meth- 
ods. We apply these results to the Gibbs sampler for independent and 
identically distributed observations. More precisely, we show that if both 
the sample size and the running time of the Gibbs sampler tend to infinity, 
and if the initial guess is not far from the true parameter, the empirical 
distribution of Gibbs sampler tends to a posterior distribution. This is a 
local property of the Gibbs sampler, which may be, in some cases, more 
essential than the global properties to describe its behavior. The advan- 
tages of using the local properties are the generality of the underling model 
and the existence of simple equivalent Gibbs sampler. Those yield a sim- 
ple regularity condition and suggest the reason for non-regular behaviors, 
which provides useful insight into the problem of how to construct efficient 
algorithms. 

1 Introduction 

This paper investigates conditions under which a Markov chain Monte Carlo 
(MCMC) method has a good stability property. There have a vast literature 
related to the sufficient conditions for ergodicity: see reviews [17] and [14] and 
textbooks such as [13] and [11]. The Markov probability transition of MCMC 
is Harris recurrent under fairly general assumptions. Moreover, it is sometimes 
geometrically ergodic. In practice, Foster-Lyapunov type drift conditions are 
commonly used to establish geometric ergodicity. This drift condition works 
well in studying MCMC stability, but there are some limitations. 

• Technical difficulty in constructing a drift condition. See [3] for detail. 

• The condition describes global properties of MCMC such as global con- 
vergence rate and global mixing rate, but not a local property. For some 
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MCMC methods, local properties seems important for describing MCMC 
efficiency/inefficiency (see ex. Examples 6 and 7 of [14]). 

We take another approach to study stability of MCMC methods. It is well 
recognized that there are two kinds of randomness for Monte Carlo methods, 
such as Gibbs sampler. One is observation randomness and the other is sim- 
ulation randomness. Usually we only consider the latter randomness for the 
analysis of Monte Carlo methods under fixed observation. However, it is nat- 
ural to consider both randomness for the analysis, and in fact, the analysis 
becomes easier if we consider observation randomness. In particular, we can 
apply beautiful results of asymptotic statistics theory when we consider a large 
sample situation. 

We obtain the following results. 

1. Consistency and local consistency of Monte Carlo procedure are studied. 

2. A reasonable set of sufficient conditions for consistency for the Gibbs sam- 
pler is addressed for independent identically distributed observations. We 
only assume (a) identifiability of parameter, (b) existence of uniformly con- 
sistent test, (c) regularity of prior distribution, and (d) quadratic mean 
differentiability of the full model. 

The paper is divided into two parts. The first part is a study of Monte Carlo 
procedure, such as importance sampling and Gibbs sampler in general. We will 
describe Gibbs sampler as a sequence of random Monte Carlo procedure. We 
prepare in Section 2, a study of non-random Monte Carlo procedure and in 
Section 3 a study of sequence of non-random Monte Carlo procedure for the 
study of a sequence of random Monte Carlo procedure. Consistency and local 
consistency are introduced in Section 3. In Section 4, we consider Monte Carlo 
procedure in general including Gibbs sampler. 

In the second part we consider more specific situation, more precisely, a large 
sample setting. In Section 5 we prepare some technical tools for the analysis 
of Gibbs sampler. In Section 6 we analyze local consistency of a sequence of 
standard Gibbs sampler in large sample setting. 

For a treatment of a large sample setting (with a different motivation), a 
recent paper [2] studied the Metropolis algorithm for increased parameter di- 
mension d. They obtained the rate of the running time of the Metropolis al- 
gorithm for burn-in and after burn-in. To deal with the complex algorithm 
and to obtain strong results, they assumed strong conditions (C.l, C.2 and 
(3.5)). Another paper, [12] and [16] obtained stability properties of the stochas- 
tic EM algorithm. Essentially they studied finite dimensional convergence of 
9(0), . . . , 6(k). However, without tightness arguments, the finite dimensional 
properties are insufficient to describe MCMC behaviors. On the other hand, we 
show the convergence of the law of the process (6(i); i <E No) with a minimal set 
of conditions. 

It is not our intension to conclude that the Gibbs sampler is always efficient. 
The conclusion of Theorem 6.4 is that under a set of fairly general assumptions, 
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the empirical distribution constructed by the Gibbs sampler converges to the 
posterior distribution in a short running time. On the other hand, it illustrates 
the reason for the non-ideal behavior of the Gibbs sampler. For example, (a) 
we fail to take a good initial guess, (b) it fails to have a strong idcntifiability 
condition (c) the Fisher information matrix 1(0) is almost or that for the 
hidden information J (9) is almost oo, or (d) the sample size is too small related 
to its parameter dimension. For example, natural Gibbs sampler on probit 
regression model corresponds to the case (c) and it fails to hold good convergence 
property, which is studied in [7]. These studies of regular/non- regular properties 
of MCMC are important step constructing new efficient Monte Carlo algorithms 
including adaptive MCMC methods. 

1.1 Notation 

Let N = {1, 2, . . . , } and N = {0, 1,2,.. .}. We write the integer part of x G R 



1.1.1 Probability measure, Transition kernel 

For measurable space (E,£), the space of probability measures on (E,£) is 
denoted by V{E). 

For two measurable spaces (E,£) and (F, F), a probability transition kernel 
K from E to F is a map K : E x F — s- [0, 1] such that 

1. K(x, •) is a probability measure on (F,F) for x E E. 

2. K(-, A) is ^-measurable for any A £ F. 

We may write K(dy\x) instead of K(x, dy). If K(x, •) is cr-finite measure instead 
of probability measure, we call K a transition kernel. 

1.1.2 Normal distribution 

Write <fi(x) = exp(— x 2 /2) / y/2ir for a probability distribution function of N(0, 1) 
and write = J_ <p(y)dy. For \i £ R p and p x p-positive definite matrix E, 
a function <j)(x; /x, E) = exp(— a; T E _1 2;/2)/(27rdct(E)) 1 ' /2 is a probability distri- 
bution function of N(fi, E) = N p (fi, E) where det(E) is a determinant of E and 
x T is a transpose of a vector x £ R p . 

1.1.3 Centering 

For a probability measure /i on R, a central value is a point x £ R satisfying 



Element of R p is denoted by x = (x 1 , . . . , x p ) T . For a probability measure 
fj, on W, let n l (A) be / i6R lA(x l )fi(dx) for A £ B(R). For (i, we call x = 



by [x]. 
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(x 1 ,x 2 , . . . ,x p ) T <G R p central value if each x 1 is a central value of \i % . There 
is no practical reason for the use of the central value for Markov chain Monte 
Carlo procedure as is used in this paper. We use it because of its existence and 
continuity. That is, (a) for the posterior distribution P n (d0\x n ), its mean does 
not always exist but the central value does and moreover, it is unique and (b) 
if \i n — s- /i, then the central value of /j, n tends to that of fi. See [5]. 

2 Non-random Monte Carlo Procedure 

Let (&,d) be a complete separable metric space equipped with Borel cr-algebra 
S. Let (S, S) be measurable space. Usually, (S, S) = (0, S) but it is not always 
the case. Write S N ° for a countable product of S. We use a notation Soo = 
(s(0), s(l), . . .) € S N °. We write a subsequence s m = (s(0), s(l), . . . , s(m— 1)) e 
S m of Soo . 

2.1 Definition of non-random Monte Carlo Procedure 

Now we arc going to define a non-random Monte Carlo procedure. It may sound 
strange since Monte Carlo procedure has always randomness. The term "non- 
random" means that the Monte Carlo procedure does not depend on observation 
which will be denoted by x. We consider a Gibbs sampler as an example. 





observation randomness 


simulation randomness 


Non-random MC 


X 





(Random) MC 









Gibbs sampler is a method which generate Markov chain Ooo = (0(0), 0(1), . . .) 
depending on an observation x. Thus there are two kinds of randomness induced 
by x and 0^. The former randomness is an observation randomness and the 
latter is a simulation randomness. Any Monte Carlo procedure uses a simulation 
randomness but it may not use observation randomness. We call it non-random 
if it does not use observation randomness. 

Non-random Monte Carlo procedure is constructed by a probability measure 
M on S* N ° and a sequence e = (e m ;m = 1,2,...) where e m is a probability 
transition kernel from S m to 0. Now we consider a simple example for the case 
<d = S. If we approximate integral of measurable function / of with respect 
to a probability measure II on 0, that is, 

n(/) = f f(x)u(dx) 

we generate independent sequence m = (0(0), ■ ■ ■ , 0(m — 1)) from II and set 

m— 1 

- E /(»(*))■ (2-1) 

i=0 
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This procedure is generated by M = n® N ° (see Example 2.2) with 

^ m— 1 

e m (#m, ') = — > J Sgu) 
i=0 

where 8g is a Dirac measure with its mass on £ 0. Then (2.1) is J Q f(6)e m (6 m ,d6). 
Thus we define Monte Carlo procedure as follows. 

Definition 2.1 (Non-random Monte Carlo procedure). Let M be a probability 
measure on 5 N ° and let e m be a probability transition kernel from S m to for 
m = 1,2, . . . and set e = (e m ; m = 1,2,...). We call M. = (M, e) a non-random 
Monte Carlo procedure on (5,0). 

If S = 0, we call .A/f, a non-random Monte Carlo procedure on 0. Be- 
fore introducing examples, we prepare some remarks. For simplicity, we write 
e m (soo) or e m (s m ) instead of e m (s m , •). When S = 0, if e = (e m ; m = 1, 2, . . .) 
is defined by (2.1) we call e a sequence of empirical distribution. If (S,S) is a 
product of (0, S) x (Y, 3^) for some measurable space (Y, 3^) and if e m (s m ) (s m = 
((9(0), 2/(0)), . . . , (6(m— 1), y(m— 1))) is the same as the right hand side of (2.1), 
then we call e a sequence of empirical distribution on 0. 

Many Monte Carlo methods can be represented as a Monte Carlo procedure 
defined above. An important exception is some sequential Monte Carlo methods. 
It is felt to require other framework for the analysis of sequential Monte Carlo 
methods. 

Example 2.2 (Non-random crude Monte Carlo procedure). Take S = 0. Let 
II be a probability measure on and e be a sequence of empirical distribution. 
Let M = n® No , that is, 

OO 

M(dfloo) = jjn(d0(i)). 

i=0 

Then a non-random Monte Carlo procedure M. — (M, e) is called a non-random 
crude Monte Carlo procedure on 0. 

Example 2.3 (Non-random importance sampling procedure). Take S = 0. 
Let II, Q be a probability measure on and let II be absolutely continuous with 
respect to Q, that is, H(A) = if Q(A) = for A e S. Let e = (e m ;m = 
1,2,...) be 

e m (0 m ) = ^J2 ^( fl (»))*«W 

and M = Q N ° . Then a non-random Monte Carlo procedure Ai = (M, e) is 
called a non-random importance sampling procedure on 0. 

Non-random accept-reject method has a similar form. All of these Monte 
Carlo procedure bases on countable products of probability measures M = 
II® No . The properties of these Monte Carlo procedure will be discussed later. 
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2.2 Non-random Markov chain Monte Carlo procedure 

We are going to define non-random Markov chain Monte Carlo procedure as a 
class of non-random Monte Carlo procedure. Let /ibea probability measure on 
5 and K be a probability transition kernel on 5. Then 

oo 

M{d Soo ) = fi(ds(0)) Y[ K(s(i),ds(i + 1)) 

i=0 

is said to be Markov measure on 5 N ° generated by (n,K). 

Definition 2.4 (Non-random Markov chain Monte Carlo procedure). Let M 
be a Markov measure on 5 N ° and let e m be a probability transition kernel from 
S m to for m = 1, 2, . . . and set e = (e m ; m = 1,2, . . .). We call M = (M, e) 
a non-random Markov chain Monte Carlo procedure on (5,0). 

First we state some possibilities for e. 

Example 2.5 (Burn-inn, thinning). Take 5 = 0. Let M. = (M,e) be non- 
random Markov chain Monte Carlo procedure. If e = (e m :m = 1,2,...) is 
defined by 

^ m— 1 

e m (s m ) = f — 77-r V 5e(i) 

m — to/2 v ' 

L ' 1 i=[m/2] 

e is called a sequence of empirical distribution with burn-inn, and if 

[m/2]-l 

e is called a sequence of empirical distribution with thinning. 

Now we are going to define a Gibbs sampler as an example of non-random 
Monte Carlo procedure. 

Example 2.6 (Non-random Gibbs sampler). Let (Y,y) be a measurable space 
and set (5,5) = {Y, y) ® (0,3). Let P{dy\6) and P(d9\y) be a probability 
transiting kernel from to Y and Y to with respectively. Let M. = (M, e) 
be a Markov chain Monte Carlo procedure on (5, 0) having K as a probability 
transition kernel for M defined by 

K((y,e),d(y*,e*)) = P{dy*\8)P{d6*\y*). 

Then M. is called a non-random Gibbs sampler on (5, 0). Note that K((y, 9), ■) 
does not depend on y. 

For the analysis of the Gibbs sampler, usually, it is sufficient to study 

K(6,d6*)= f P(dy\0)P(d9*\y). (2.2) 
JyeY 
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See Definition 2.12 for detail. An important exception, which requires the anal- 
ysis of K instead of K is Rao-Blackwellization strategy. Rao-Blackwellization 
is an effective strategy for Markov chain Monte Carlo procedure. We consider 
this strategy as a one example of e. 

Example 2.7 (Rao-Blackwellization). Let M. be a non-random Gibbs sampler 
on (S, 0) defined as above. Take e m as 

m— 1 r. 

e m (s m ,A) = -Y] / P(d6\y(i)) 

where s m = (s(0), . . . , s(m — 1)), s(i) = (y(i),8(i)) for A € E. Then for the 
choice of e = (e m ;m = 1,...), e is called an empirical distribution with Rao- 
Blackwellization. 

Next we define Metropolis-Hastings algorithm in the following example. 
Transition kernel K defined in the following example may be different from 
usual one denoted by K in (2.4). We will explain the relation of two transition 
kernels after the following example. 

Example 2.8 (Non-random Metropolis-Hastings procedure). Let S = x 
[0, 1] x 0. Let n be a probability measure on and Q be a probability transition 
kernel on S. Let r be a S 2 -measurable function such that 

r{x, y)H(dx)Q(x, dy) = U(dy)Q(y, dx). 

Let a{x,y) = min{l, r(x, y)} be a measurable function called acceptance ratio. 
We define a probability transition kernel K((x,u^y),d(x* ,u*,y*)) from S to 
itself by 

Q(y,dx*)l [0A] (u)du(l(u < a(y,x*))5 x ,(dy*) + l(u > a(y,x*))6 v (dy*)) (2.3) 

When J\A = (M, e) is a non-random Markov chain Monte Carlo procedure and 
M has a probability transition kernel K , we call M. non-random Metropolis- 
Hastings procedure on (<!?, 0) generated by (11,(5). Note that K((x,u,y),-) does 
not depend on x,u. 

The above representation (2.3) of the transition kernel shows all realization 
(a) propose x, (b) u ~ U[0, 1], (c) y result of accept-reject procedure. When we 
are only interested in y, we can use simpler notation which is a usual one. For 
A(x) = J yeS cx(x, y)Q(x, dy), the following transition kernel is simpler: 

K(x,dy) = a{x,y)Q(x,dy) + (1 - A(x))S x (dy). (2.4) 

As the Gibbs sampler, usually it is sufficient to consider K (see Definition 2.12), 
but not always the case. Some algorithm such as [1] uses the information of 
proposed variable x. 

We define ergodicity and stationarity for a non-random Monte Carlo proce- 
dure. We call M = (M, e) ergodic or stationary if M is ergodic or stationary. 
Recall some terminology related to ergodicity and stationarity (see monographs 
such as [4] or [15]). Let T{ Soo ) = (s(l), a(2) . . .) { Soo = (s(0), s(l), . . .). 
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• Probability measure M is said to be (strictly) stationary if M(A) = 
M(T~ 1 A). When M is stationary, a probability measure II defined by 
H(A) := M({soo;s(0) € A}) (A G S) is called invariant probability mea- 
sure. 

• A set A G 6> N ° is called invariant if T~ 1 A = A. Let A be a cr-algebra 
generated by the invariant sets. 

• M is called ergodic if M(A) = or 1 for any A € A. 

If M is stationary and ergodic, we have the ergodic theorem (see Theorem 
10.2.1 of [4]). If M is a Markov measure generated by irreducible and positive 
Harris recurrent probability transition kernel, M is ergodic. 

Definition 2.9. Let M. = (M, e) be a non-random Monte Carlo procedure. 
When M is ergodic or stationary, we call M. ergodic or stationary with respec- 
tively. 

Stationarity and ergodicity play an important role for convergence of Monte 
Carlo procedure. 

2.3 Consistency of non-random Monte Carlo procedure 

Let BLi be a class of S-measurable R-valued functions / satisfying 

\f(s)-f(t)\<d(s,t) (s,teQ). 

When ix, v arc probability measures on 6, let w(fx,v) denote the bounded Lip- 
schitz metric, that is, 

w(fx, v) = sup | / ip(x)ix(dx) — / ^p{x)v(dx)\ 
^eBLi Jxee Jxee 

This metric is equivalent to the Proholov metric, that is w(ix n ,v) — > is equiv- 
alent to the weak convergence. We may write BLi(0) and wq instead of BLi 
and w to indicate the underlying space. 

For Monte Carlo procedure Ai = (M, e) and a probability measure fl on 8, 
we define a risk function 

R m (M > U)= I w{e m ( Soo ),U)M(d Soo ). 

Definition 2.10 (Consistency). Let Ai = (M,e) be a non-random Monte Carlo 
procedure on (S,Q) and II be a probability measure on Q.Then M is said to be 
consistent to n if liuim^oa R m (A4, TV) = 0. 

Proposition 2.11. Non-random crude Monte Carlo procedure and non-random 
importance sampling procedure is consistent to II. Moreover, non-random Monte 
Carlo procedure is consistent toll if M is stationary and ergodic with invariant 
probability measure II and e is a sequence of empirical distribution. 



Proof. In each case, the weak law of large numbers hold, that is, for any II- 
integrable function / 

/ f(0)e m (s m )(d6) - [ f(9)U(d9) -> 

in A/-probability. For e > 0, by separability and complicity of 8, there is a 
relatively compact open set K of 9 satisfying II(A" C ) < e/4. On a relatively 
compact set K, we can choose a finite sequence , . . . , ipk £ BLi such that for 
any tp £ BLi, there exists i £ {1, . . . , k} such that 

sup|V(s)-Vi(s)| < e/2. 
sex 

Let £ n : 9 — > [0, 1] be a bounded Lipschitz function such that £„ J, l^c. By 
taking ^ = + - ipi)(l - £ n ) + ^(1 - £„) and write ^(1 - £ n ) =: ^;, n , 
w(em(sm), n) is bounded above by 

em(«m,fn)+n(f„) + / W (<#) - / &,„(0)n(d0)|, 

i— 1 

where e m (s m ,£ ra ) and n(£„) are integrals of £ n with respect to e m (s m , •) and II. 
The M- integral of the last term tends to 0, and the first term tends to II(£ n ) 
as Tn — y oo since J s n M(ds 00 )e m (s m , ■) converges weakly to II. Then taking 
n — > oo, limsup„ woo R m (A4,TV) is bounded above by 

U(K C ) + U{K C ) + | + < e . 

Hence limsup m ^. 0O R m (M, IT) = as required. □ 

Wc define equivalence of Monte Carlo procedures already mentioned in Ex- 
amples 2.6 and 2.8. 

Definition 2.12 (Equivalence). Let (9, d) be metric space with Borel a-algebra 
S, and let (S l ,S l ) be measurable spaces for i — 1,2. Let M 1 = (M l ,e l ) be 
Monte Carlo procedure on (S z ,&) for i = 1,2 Then M. 1 and Ai 2 are called 
equivalent if 

R m (M 1 ,IL)=R m (M 2 ,IL) 
for any m £ N and probability measure II on 9. 

Example 2.13. Assume the same condition as Example 2.6. Let K be as 
in (2.2) and fi(d0) = J yeY Jl(dyd8) where JL is the initial probability measure 

of M and set M as a Markov measure defined by \i and K . If e m does not 
depend on Y m for m = 1,2,..., then M. = (M, e) is equivalent to M. where 
e = (ef.i = 1,2,...) and e m (6 m ) = e m (s m ). 
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3 Sequence of non-random Monte Carlo proce- 
dure 



In this section, we consider a sequence of non-random Monte Carlo procedure. 
Therefore we will consider a sequence of measurable spaces (S n ,S n ) and a se- 
quence of complete and separable metric spaces (Q n ,d n ). Write w m for the 
bounded Lipschitz metric for the space of probability measures on 0„, corre- 
sponding to the metric d n . 

Throughout in this section, Ai n = (M n ,e n ) is a non-random Monte Carlo 
procedure on (S n ,Q n ) where M n is a probability measure on and e n = 
(e„. m ; m = 1, 2, . . .) is a sequence of probability transition kernels e„ jm from S™ 
to 6„ for m = 1,2,.... We write s n>m = (s„(0), s n (l), . . . , s n (m - 1)) £ S™ 
and s ni00 = (s„(0), s n (l), . . .) <E S^°. As in the previous section, we may write 

^n,m (^n,oo ) instead of 6n,m(^Ti,mi ')• 

3.1 Consistency of sequence of non-random Monte Carlo 
procedure 

Let II„ be a probability measure on 0„ for each n = 1, 2, . . .. We define a risk 
function for each M n = (M n , e n ) for e„ = (e„ iTO ; m = 1, 2, . . .) by 



Definition 3.1 (Consistency for sequence), for n. = 1,2,..., A4„ = (Af„,e„) 
is a non-random Monte Carlo procedure on (S n ,Q n ) and II„ is a probability 
measure on 0„. Then (M. n ;n = 1,2,...) is said to be consistent to (Il„;n = 
1, 2, . . .) i/lim^oo Rm n {M n , II„) = for any m„ oo. 

We show some non-consistent examples (see also [7] for other type of non- 
consistency, degeneracy). First example is an importance sampling in high 
dimension. 

Example 3.2. Let 0„ = S n = R™ and let I„ be the n-dimensional identity 
matrix. Let N n (/i n ,Y< n ) be a normal distribution with mean /i n € R" and 
positive definite matrix £„. Consider two probability measures H n = N n (0,L n ) 
and Q n = N n ((l, 0, . . . , 0) T , /„) and let Ai n = (AI ni en) be non-random crude 
Monte Carlo procedure. We show that (M n ;n = 1,2,...) is not consistent to 
(II„;n = 1,2,...). Denote 9 = (6 1 , . . . , 6 n ) T £ 0„. Each projection of Q n to 
i-th coordinate 9 l is N(l, 1) for i = 1 and N(0, 1) for i = 2, . . . , n. Let 



K, m = {(e n (o),e n (i), . . .); ei(j) < o (j = o, . . . ,m - 1)} 

where 6 n {i) = (6l(i), . . . , 6™{i)) T £ 6„. The event N l n m has probability 2~ m 
under M n for i = 2, . . . , n. Therefore, by independence of the events (N^ m ; i = 
2,...,n), N, hm := \J\ l =2 N % nm has probability 1 - (1 - 2~ m ) n - 1 . For i = 2,'. . . ,n, 
take ip £ BLi to be 




"4>{x) = max{0, min{l, x 1 }} [x — (x , . . . , 



x n ) T G ©n). 
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Then if (0„(O), 0„(1), . . .) e N^ m , we have 

ip(x)e n ,7n(9 n , m )(dx)- ijj(x)U n (dx)\ = mm{l,x}cf)(x)dx =: c> 0. 
ige„ Jxe@ n Jo 

Therefore w n (e n ^ m (6 ntm ), II„) > c on N n ^ m and hence 

R m (M n ,U n ) > (1 - (I - 2- m ) n - 1 )c. 

We can choose m n — > oo to be liminf n _>. 00 ii mn (A^ ra ,II ra ) > 0. Hence it is not 
consistent. 

(Geometric, or uniform) Ergodicity may not provide enough information 
whether a given Markov chain Monte Carlo method works well or not. In that 
approach, we have to analyze good estimate of the convergence rate of the total 
variation distance of the marginal distribution or asymptotic variance of the 
empirical mean. The analysis of consistency may provide another viewpoint. 
Sometimes it provides a good information for the behavior of Markov chain 
Monte Carlo methods. 

Example 3.3. Let 6„ = [— n, n] and S n = 0„ x [0, 1] x 8„. For n = 1,2, ... let 
n„ be a restriction of N(0, 1) to the interval [—n,n\. Let Q n (x,dy) = Q(dy) = 
A^(0,?i _1 ). Consider non-random Metropolis-Hastings procedure M n = (M n ,e n ) 
on 0„ generated by (Il„, Q n ) where e n is a sequence of empirical distribution on 
and M n is a Markov measure with initial distribution <5o(<iE)l[o,i] (du)Q(dy). 

Intuitively, this non-random Metropolis-Hastings procedure works poorly, and 
it is true. It is easy to see by checking consistency and degeneracy 

Consider equivalent Markov chain Monte Carlo procedure M. n = (M n ,e) as 
in the comment after Example 2.8. For any fixed m <= N, 

M n {{6^; max \9{i)\ > 1}) < Q^ N °({^oo; max \9(i)\ > 1}) -> 

i— 0,...,m — 1 i— 0,...,m— 1 

for = (9(0), 9(1), . . .). Take if) e BLj to be 

ip(x) = max{0, min{l, |x| — 1}} (x <E ® n )- 
Then if maxi = o,..., m -i — 1> 

^(x)e m (e m )(dx)- f i>(x)U n (dx)\ = f mi ^\ l * ii(x)dx > c> 
xes n Jxes n J\x\>i l-2$(n) 



Ther 



R m (M n ,IL n ) > M n ({0«,; max \9(i)\ < l})c -> c. 

i— 0,...,m— 1 



Hence we can choose m n — » oo to be limim c n _>. 00 R m „ (M n , n„) > 0. Therefore 
(M n ; n = 1, 2, . . .) is noi consistent to (h n ; n = 1,2,...). 
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3.2 Sufficient conditions for consistency of non-random 
Markov chain Monte Carlo procedure 

Let (&,d) be a complete separable metric space equipped with Borel cr-algebra 
S. In this subsection, we assume 

(Q n ,d n ) = (Q,d), (S n ,S n ) = (Q,E), d(s,t)<l(s,t€Q). (3.1) 

Note that the assumption d(s, t) < 1 is just for simplicity and all results in this 
paper arc valid without this assumption. 

Write G N ° for a countable product of G. We use a notation Ooo = (0(0), 0(1), . . 
G N °. We introduce a metric by 

oo 

doo(C O = E a-*- 1 ^ 1 0), 2 «) (3.2) 

i=0 

where 0^ = (6> ?; (0), 6> 4 (1), . . .) G G N ° (i = 1,2). We write a subsequence 
0m = (0(0), 0(1), . . . , 8(m — 1)) G G m of #oo introducing a metric d TO such 
that d m (0l n ,0'^ n ) is the same as the right hand side of (3.2) replacing "oo" by 
"m — 1" where l m is subsequence of 01^ for first m elements. Let Woo and w m 
be a bounded Lipschitz metric for V(Q N °) and V(Q m ) defined by doo and <i m 
with respectively. The next two propositions are fundamental results for the 
consistency of Monte Carlo procedure. 

Proposition 3.4. Let M n = (M n ,e) be a non-random stationary Monte Carlo 
procedure with invariant distribution II„ for n = 1,2, ...,oo. Moreover, Moo 
is ergodic and e is a sequence of empirical distribution. If w 00 (M n ,M 00 ) — > 0, 
then (M n ; n = 1, 2, . . .) is consistent to (Tl n ; n = 1,2,...). 

Proof. First we show that for stationary Monte Carlo procedure M. = (M, e) , 
and for k < m, 

R m (M,U)<R k (M,U) + -. (3.3) 

m 

Write II(^) for J ip(x)H(dx). By definition, 




We divide sequence m into subsequence of length k, that is, divide m into 0{ = 
(0(jk),0(jk+l), . . .,0((j+l)k-l)) (j = 0, . . . , [m/k]-l) and 0(k[m/k]), . . . ,0(m- 
1). Then 

1 m-l , [m/fe]-l fe-1 ro-1 

-$X*(i)) = £ E lEv^fc +<)) + - E W))- 

i=0 j=0 i=0 i=k[m/k] 

This relation yields 

[m/fc] — 1 m-l 

w (e ro (0 m ),n) < - E He k (oi),u) + - E hW))-n(VOI- 

m ' to ^— ' 

j=0 i=fc[m/fc] 
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For the first term of the right hand side, by stationarity, each w{ek{6]^), II) for 
j = 0, . . . , [m/k] — 1 has the same law under M. For the second term, we have 
a relation 

\ip{6) - n(v>)| < / \tp(6) - ip(e*)\ii(dd*) < ( d(e, e*)u(de*) < 1. 

Je*ee J 
Using these relations, we have 

„ / . . „s k r m 1 to — k\m/k] 

R m (M,Il) < - \-]R k (M,Il + 

to K m 

Since x — 1 < [x] < x, (3.3) follows. Applying this result to M. n and II„, we 
have for any m n — > oo, 

limsupi? m „(A^„,II„) < limsupi? fe (A^„,n, i ) < limsupi? fe (7W„, II^) 

n— >oo n— >-oo n— >oo 

where the second inequality comes from R k {M n , II„) < R k (M n , n oo )+w(IT„, ITpo) 
and w(n„, Uoo) -> by Woo(M„, M m ) 0. 

Now we are going to show the continuity of 9^ i-> w(ek(9oo), ^oo)- If we 
have the property, limsup,^^ Rk(M n , Roc) = Rk(-Moo,Iloo) and it tends to 
as fc — > oo by Proposition 2.11. Therefore it is sufficient to show the continuity 
for the proof of i? m „(-M„,II„) -> 0. For 6^ = (0*(O), . . .) (t = 1,2), by 
triangular inequality, 




which is bounded above by 

m— 1 

-X; d ( fll «) ) f a (i))<2 m doo(fi J ,0 

i=0 

Hence w(ek(0oo), IIoo) is continuous and the claim follows. □ 

Let /x be a probability measure on and let -ftT be probability transition 
kernel on G. Let fj, <E> K be a probability measure on 9 2 defined by 

fj, ® K(d9, d9*) = n(d9)K(9, d0*). 

For any probability measures p, q on a metric space d) with Borcl cr-algcbra 
£, we define total variation distance by 

\\p-q\\ = sup \p(A)-q(A)\>w B (p,q) (3.4) 
Ae£ 

where we is a bounded Lipschitz metric on the space of probability measures 
on E. 

The following lemma due to Le Cam is very useful for our purpose. See 
Lemma 12.2.2 of [8] or Lemma 6.4.2 of [9]. 
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Lemma 3.5. Let (Q,d) be a separable and complete metric space with Borel a- 
algebra S. Let be a probability measure on and Ki be a probability transition 
kernel form to itself for i = 1,2. Then 

/ ||A'i(a;, •) - K 2 (x, -)||(/ii + Ma )(<&:) < 4||^i ®K 1 - p 2 ® K 2 \\. 

Using this lemma, the following is just an easy corollary. 

Proposition 3.6. Let A4 l n = (M^,e l n ) be non-random stationary Markov chain 
Monte Carlo procedure with invariant probability distributions IL l n for i = 1,2 
and n = 1,2,.... Write K l n for the probability transition kernel of M l n . If 
\\Ui <g> K\ - Il 2 n ® K\\ ->■ 0, then w^M^M*) -> 0. 

Proof. Write M* jTn for the restriction of M l n to (0 m ,S m ), that is, 

rn-2 

Mi tm (de m )=Ui(d6(0)) [] Ki(6(i),d6(i + 1)) (0 m = (0(O),...,0(m-l)). 

4=0 

Recall that w m is a bounded Lipschitz metric for the space of probability mea- 
sures of m . Write (0 ro ,O) € No for (0(0), . . . , 0(m-l), 0, 0, . . .) where means 
just a fixed element of 0. By definition 

w 00 {Ml,Ml)= sup | f ^(MA^oo) - / V(0oc)M'(d0oc)| 

and by taking ?/>(M = i/j(0 m , 0)+(^(6 oo )-^(e m , 0)) and by |V>(M-V^m, 0)| < 
rfoo(0oo, (0mjO)) < 2~ m the above is bounded above by 

«, m (Mi fm> M£j + 2*2-™ 

Therefore, to show w 00 (M^ l , M%) -> 0, it is sufficient to show w m (M^ m , M% m ) 

for any m = 1,2, In fact, we can show ||M* m — M% m \\ — ^ for any 

m = 1, 2, . . . which is stronger than w m (M^ m , M% ) — !• by (3.4). 

The convergence holds for m = 1,2 by assumption. Now assume that the 
convergence is true for any m = 1,2, For m = k + 1, observe that 

( M lfc+i " A£ )Jk+1 )(d0 fc+ i) equals to 

- Ml k ){de k )Kl{6{k), d9(k + 1)) + Ml k {d6 k ){Kl - Kl){6{k),d6{k + 1)). 

Then ||M^ fc+1 — fe+1 || is bounded above by 

ll< fe - KkW + J Ml k {d9 k )\\{K l n - K 2 n )(9(k),-)l 

The former tends to by the assumption of the induction. Since (0,e0 is 
separable and complete, the second term equals to 

J nl(dff)\\{Kl-Kl)(6,-)\\<A\\nl®Kl-Iil®Kl\\^Q 

by Lemma 3.5. □ 
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3.3 Localization 

In this section, we consider localization of non-random Monte Carlo procedure. 
The following example illustrates its motivation. 

Example 3.7. Let 0„ = S n = R and U„ = N(0, n~ x ). For n = 1,2, . . let 
Aii.n and M.2,n be non-random crude Monte Carlo procedures corresponding to 

Qx,n = 8{o}, Qi,n = N{jT x ,n~ x ), 

with respectively with sequence of empirical distribution e. In the comment after 
Proposition 3.9, we will show that both (M.\^ n ;n = 1,2,...) and (M.2,n\n = 
1,2,.. .) are consistent to (II„; n = 1,2,...). However the latter seems preferable 
than the former. 

Make a projection ip : 9 i— > n 1 / 2 ^. Then the probability measures becomes 
n* = LI* = N(0, 1) and 

Qi,n = 8{0}, Ql n =N(n- 1 ' 2 ,l). 

Let Ail n and M2 n be corresponding non-random crude Monte Carlo. Then 
(Ml n ] n = 1,2, . . .) is not consistent to (LI* ; n = 1, 2, . . .) since 

R m (Ml n , U* n ) = w(S o ,N(0, 1)) > ( ffl) n = 1, 2, . . .). 

On i/ie other hand, if we write non-random crude Monte Carlo for II* with e 
by Mq, then 

Rm n (Ml n , K) = Rm„ (M* 2 , n , Nin- 1 / 2 , l))+o(l) = i?,„„ (A^, H*)+o(l) = o(l) 

as ?i — > 00 6?/ Proposition 2.11. Hence (M.2.n\n = 1,2,...) is consistent to 
(LI*;n = 1,2,...) although (M.i, n ;n = 1,2,...) is not. In this sense, M.2,n is 
preferable. 

As the above example, (M.2,n\ H = 1, 2, . . .) has better property than (A4i, n ; n = 
1,2,...). We will call [M.2.n\n — 1,2,...) locally consistent to (II n ;n = 
1,2,...). We are going to make a formal definition. 

Assume 6 n e9c R p and let d n = d be a usual metric on R p . Let 6 n E 
and S n > such that 6 n 0. Let 

ip n :9^ s-^e-On). 

Let M. n = (M„,e„) be non-random Monte Carlo procedure. For a probabil- 
ity measure Q on 0, we define a localization Q* by Q*(A) = Q(ip~ 1 (A)) = 
Q(^n + <^?,A). Let LI* and e* m (soo) be localizations of II n and e„ )m (soo) with 
respectively. Then M* n := (M„,e*) where e* = (e* m ;m = 1,2,...) is a non- 
random Monte Carlo procedure. 

Definition 3.8 (Local consistency). (M. n ;n = 1,2,...) is said to be locally 
consistent to (II n ; n = 1, 2, . . .) £/ (A^* ; n = 1, 2, . . .) is consistent to (LI* ; n = 
1,2,...). 
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As in the proof of the following proposition, local consistency implies con- 
sistency. Moreover, if (II* ;n = 1,2,...) is tight, it is consistent to a point 
mass. 

Proposition 3.9. Assume (M. n \ n. = 1, 2, . . .) is locally consistent to (II„; n = 
1,2,...) and (II* ;n = 1,2,...) is tight, that is, for any e > 0, there exists a 
compact set K such that limsup,^^ II* (K c ) < e. Then (M n ; n = 1, 2, . . .) is 
consistent to (II„ := 5g ; n = 1, 2, . . .) 

Proof. By tightness of II*, iu e (II n ,II n ) -> 0. Write 8* = ^„(0). Since 

V^eBL 1 (e)^^; 1 ^eBL 1 (e* t ), 

we have we(em,n„) = 6 n ws* (e m ,LI*). Hence by local consistency and S„ — > 
0, we have R mn (M n ,TL n ) — > for any m n — > oo. Therefore by triangular 
inequality, R m . n (M „ , fl n ) < R m „ (M„, II„) + w(H n , fl n ) 0. □ 

In Example 3.7, (-Mi, n ; n = 1, 2, . . .) is consistent to (H n ; n = 1, 2, . . .) since 

i?m„(A^i,™,n„) = w(Qi,„,n„) 0. 

Consistency of (A^2,n! n = 1,2, . . .) comes from local consistency of (A^.n! n = 
1,2,.. .) by Proposition 3.9. 

Remark 3.10. For the study of non-regular behavior of Monte Carlo proce- 
dure, other localization is more natural in some cases. However in the current 
study, we only use above localization and we do not pursue here for the other 
possibilities of scaling. 

4 Random Monte Carlo procedure 

In this section, we consider random Monte Carlo procedure instead of non- 
random Monte Carlo procedure. Convergence property of Gibbs sampler will 
considered in this framework. Consistency and local consistency are defined as 
good properties of Monte Carlo procedures. 

4.1 Definitions of Monte Carlo and Markov chain Monte 
Carlo procedure 

Let (X, X , P) be probability space, (S, S) be a measurable space and (O, d) be a 
complete separable metric space equipped with its Borel a-algebra S. Let 5' No 
be a countable product of S and let S® N ° be its c-algebra. Write an element 
S No by Soo = (s(0), s(l), . . .) and s m = (s(0), s(l), . . . , s(m - 1)) G S m . 

Remark 4.1. In general, (S,S) and (0,d) may depend on the element of 
x £ X, such that (S X ,S X ) and (Q Xl d). Although this dependency is used in 
implicitly, it is not important in this paper. We omit it and simply write (S, S) 
and (&,d) as above. 
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Definition 4.2 (Monte Carlo procedure). Let M be a probability transition 
kernel from X to S N ° , that is 

1. M(x, •) is a probability measure on (S N ° , <S® N °) for any x G X . 

2. M{-,Aoo) is X -measurable for any A x e 5 <glNo . 

Let e m be a probability transition kernel from X x S m to for m = 1,2,... and 
e = (e m ;n = 1,2,...). We call M. = (M,e) a Monte Carlo procedure defined 
on (X, X,P) on (<!?, 0), or simply, Monte Carlo procedure. 

If S = 0, we call Ai, a Monte Carlo procedure defined on (X, X, P) on 0. As 
non-random Monte Carlo procedure, we write e rn {x, Sqo) or e m (x,s m ) instead 
of e m (x, s m , ■), and we also write e m (s 0o ) or e m (s m ) if it does not depend on x. 
When 5 = and e m (x, s m ) = m _1 2i=o ^s(i)j we ca U e = ( e m! m = 1,2,...) 
a sequence of empirical distribution. 

Let n and K be probability transition kernels from X to 5 and X x S to 
S with respectively. Let M be a probability transition kernel from X to S™ . 
When M{x, •) is a Markov measure with initial distribution (j,(x, •), probability 
transition kernel K(x,-,-), we call M a random Markov measure generated by 



Definition 4.3 (Markov chain Monte Carlo procedure). When Ai = (M,e) is 
a Monte Carlo procedure and M is a random Markov measure, M. is called a 
Markov chain Monte Carlo procedure. 

Example 4.4. Let (X, X , P) be a probability space and (Y, y) be a measurable 
space and set S = Y x 0. Let P(d6\x, y) and P(dy\x, 9) be probability transition 
kernels from X x Y to and X x to Y . When a random Markov measure 
M is constructed by a probability transition kernel K defined by 



then Markov chain Monte Carlo procedure M = (M, e) is called a Gibbs sampler. 

Definition 4.5. Markov chain Monte Carlo procedure M. = (M, e) and M is 

called stationary or ergodic if M(x, ■) is stationary or ergodic for P-a.e. x G X . 

If M is stationary, for A g S, let U(x,A) := M(x, {s^; s(0) g A}). The 
probability transition kernel II is called an invariant probability transition kernel 
for K, M and M. 

4.2 Consistency of Markov chain Monte Carlo procedure 

Let Ai = (M, e) be a Monte Carlo procedure defined on (X, X, P) on (5, 0) 
and let II be a probability transition kernel from X to 0. Let it; be a bounded 
Lipschitz metric on V{&). 



K(x, (y, 6),d(y*,0*)) = P(dy*\x, 9)P(d0* \x, y*), 



Let 
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and 



R m (M,U) = 



W m {M{x),l\{x))P{dx). 



It is natural to extend the definition of consistency for non-random Monte Carlo 
procedure to random Monte Carlo procedure as follows. 

Definition 4.6 (Consistency). A Monte Carlo procedure M. = (M, e) defined 
on {X, X , P) on (S, 0) is called consistent to a probability transition kernel II 
from X to if R m (M.,H) -> for m -> oo. 

Now we consider a sequence of Monte Carlo procedure. Let (X n ,X n ,P n ) 
be a probability space and (0„,<i™) be a complete and separable metric space 
and (S n ,S n ) be a measurable space for each n = 1, 2, . . .. Let w„ be a bounded 
Lipschitz metric on V(Q n ) defined by a metric d n . For a Monte Carlo procedure 
.M„ = (M n , e n ) for e n = (e„. m ; to = 1,2,.. .), and probability transition kernel 
II„ from X n to 0, we define 



and R m (M „, II„) = / W m (M „ (x n ),U n (x n )) P n (dx n ) . 

Definition 4.7 (Consistency for random sequence). A sequence of Monte Carlo 
procedure Ai n = (M n , e n ) defined on {X n , X n , P n ) on (S n , Q n ) for n = 1, 2, . . . 
is called consistent to a sequence of probability transition kernels H n from X n 
to 6„ for n =1,2,... if linin^oo R mn (M n , n n ) = for any m n ->• oo. 

The definition is different from lim„_ ) . 0O Rm„ {M-n, n n ) = for certain m n — > 
oo. It should be any m n — > oo. For example, a natural Gibbs sampler for simple 
binomial model (with scaling defined later), for any m„ such that m n /n — > oo, 
the convergence hold. However, it can not take m n = log(ra). The performance 
of the Gibbs sampler is very poor in simulation. In this sense, the requirements 
for "any m n — > oo" is important. This slow convergence property is called 
weak consistency, and it will be studied in a separate paper. Fortunately, under 
regularity condition, Gibbs sampler is consistent under the scaling defined later. 

Now we are going to state sufficient conditions for consistency for Markov 
chain Monte Carlo methods. We can generalize the results of a non-random 
sequence of Markov chain Monte Carlo procedure to a random sequence. We 
assume (3.1). Recall that Woo is a bounded Lipschitz metric on "P(0 N °). 

Proposition 4.8. Let M n = {M n ,e) be a sequence of stationary Monte Carlo 
procedure defined on (X n , X n ,P n ) on with sequence of empirical distribution 
e for n = 1,2,.... Let II„ be a probability transition kernel from X n to 
which is the invariant probability transition kernel of M n for n = 1,2,.... Let 
Aioo = (Mooje) be ergodic, stationary non-random Monte Carlo procedure with 
invariant probability measure lioo . If 




oo 



),TL n (x))M n (x n ,ds 

CO j 




(4.1) 



then {AA n ] n = 1,2, . . . 



) is consistent to (!!„; n = 1,2,...). 
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Proof. This is just a direct application of the non-random case. By (3.3), 

k 

),U n (x n )) h 

and by triangular inequality, 

W k (M n (x n ),U n {x n )) < M / fc (A^„(x„),n oo ) +w(U oc ,U n {x n )). 
By (4.1), / «;(n oo ,n„(a;„))P n (dx„) -> 0. Therefore 

limsupi? m „(A^„,n„) < limsupi?/ c (A^„,n oo ). 

n— >oo n— >-oo 

Next we show two convergence properties 

lim iZfc(M„,IIoo) -> ^(M=o,n M ), Um RkiM^U^) = 0. (4.2) 

n— s-oo k— s-oo 

Define non-random Monte Carlo procedure A^„ = (M n , e) on by 



M„ ((20^ ) = / P„ (dx n )M n (x n ,<#«,)• 

Jx„EX„ 

Then R k {M. n , n M ) = Rk(M n , n w ) and (4.2) becomes convergence of non- 
random Monte Carlo procedures. Since (4.1) implies w 00 (M T j, M^) — > 0, the 
claim follows by Proposition 3.4. □ 

Let M be a random Markov measure defined on (X, X, P) generated by 
(j,(x,d9) and K(x,9,d0*). Then we define 

(/i <g> #)(a;, d0, (20*) = /Lt(x, d9)K(x, 0, (20*). 

The proof of the following proposition is exactly the same as the non-random 
case. We omit it. 

Proposition 4.9. Stationary Markov chain Monte Carlo procedure Ai l n = 
(M^,e l n ) is defined on (X n ,X n ,P n ) on with invariant probability transition 
kernel W n for i = 1, 2 and n = 1, 2, . . .. If 

lim / yilJi $ Ki{x n , ■) - l£ gi .)||Pn(dsCn) = 0, 



2/ierc / w 00 (Mi(a; n ),M2(x„))P n ((2a; n ) -> 0. 

As the same as the non-random case, we define equivalence for random 
Monte Carlo procedures. 

Definition 4.10 (Equivalence). Let (X,X,P) be a probability space and let 
(0,(2) be metric space with Borel a-algebra S, and let (S 1 ^ 1 ) be measurable 
spaces for i = 1,2. Let M l = (M',e') be Monte Carlo procedure defined on 
(Jf, X, P) on (S l , 0) for i — 1,2 Then M} and M 2 are called equivalent if 

R m (M 1 ,Il)=R m (M 2 ,Il) 

for any m € N and probability transition kernel H from X to 0. 



19 



4.3 Localization and non-stationarity 

In this subsection, we consider two topics, localization and non-stationarity of 
Monte Carlo procedure. First we define localization for random procedure. For 
random Monte Carlo procedure, localization is also random. Assume 0„ = C 
R p and d n = d is a usual metric on R p . Let 8 n : X n — > 8 be ^-measurable 
map and S n > such that 8 n — > 0. Let 

tp n ■.e^6- 1 {e-Q n ). 

Let M n = (M n ,e n ) be a Monte Carlo procedure. For a probability transition 
kernel Q from X n to 0, let Q* = Q* be a localization defined by Q*(:En,^l) = 
Q(x ni n (x n ) + S n A) for a Borcl set A. Let II* and e* m (x n , s m , •) be local- 
izations defined by II* (x„, A) = IL n (x n ,8 n (x n ) + 8 n A) and e* nm (x n ,s m ,A) = 

en,m(zn,s m ,0„(x„) + S n A). Then 7W* := (M n ,e*) where e* = (e* m ;m = 
1,2,...) is a Monte Carlo procedure. When S n = Q, we may use localiza- 
tion on M n not on e„ and set A/"* = (M*,e n ) by taking M*(x n , Aoc) = 
MJ ) where 6A X = {(6s(0),6s(l), . . .); (a(0), s(l), . . .) e 

These two localizations A'J* and A/"* are equivalent. 

Definition 4.11. (A4 n ;n = 1,2, . . .) is said to be local consistent to (LI n ;n = 
1,2,.. .) if (M„ ; n = 1,2, . . .) is consistent to (II* ; n = 1,2,...). 

Second, we consider non-stationarity. In other part of the paper, Markov 
chain Monte Carlo procedure is assumed to be stationary, which is an unrealistic 
assumption. The choice of the initial probability transition kernel /J, n (x n ,d9) 
is an important part for designing Monte Carlo method. This choice heavily 
depends on the structure of model which is more difficult to make a general 
framework. The following is one possibility which is fundamental proposition 
for the choice of /i„. 

For e > 0, when two a-finite measures /i and v of (E,£) satisfies fi(A) < 
v(A) + e for any A € £, we write fi < v + e. 

Proposition 4.12. Let M n = (M n ,e n ) be stationary Markov chain Monte 
Carlo where M n is generated by (Tl n ,K n ). LetN n = (N n , e n ) be another Markov 
chain Monte Carlo procedure where N n is generated by (ii n ,K n ) for \i n ^ LI n . 
For any e > 0, there exists c > such that 

limsupP„({x„;^„(a; n , •) < cLI„(a;„,-) + e} c ) < e. (4.3) 

n— >oo 

Then if (M n ; n = 1, 2, . . .) is consistent to (n„; n = 1,2,...), (J\f n ; n = 1, 2, . . .) 
is also consistent to (LT„; n = 1,2,...). 

Proof. Take A f n = {x n ; n n (x n , •) < cll„(a;„, •) + e}. If x n £ A e J 2 , 

W m (M n (x n ),U n (x n )) < cW m (M„(x n ),Il n (x n )) + e/2. 

Hence 

i? m (AA,n„) < P n ((A e J 2 ) c ) + cRm(M n ,Hn) + e/2, 
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and take n — > oo, we have limsup^^ R mn (AT, II ra ) < e. Hence the claim 
follows. □ 

The meaning of the above proposition becomes clear when we make a lo- 
calization. It says that with certain regularity of the model and Markov chain 
Monte Carlo procedure, it is sufficient to find a n 1 / 2 -consistent estimator of 9 n 
to construct a consistent Markov chain Monte Carlo procedure. We illustrate it 
in the following example. 

Example 4.13. Let = TV and (X n ,X n ,P n ) be a probability space. We 
prepare some assumptions. 

1. 9 n : X n — > is P n -tight. That is, for any e > 0, there exists a compact 
set K such that limsup^^^ P n ({9 n ( x n) ^ K}) < e< 

2. A4 n = {M n ,e n ) is stationary Markov chain Monte Carlo procedure where 
M n is generated by (Il n , K n ) for n = 1, 2, . . .. (A4 n ', n = 1, 2, . . .) is locally 
consistent to (H n ;n = 1,2,...) under a map ip n (x n ) : 9 i—> n}^(8 ~ 9 n ). 

3. 1(9) is a p x p-positive definite symmetric matrix. It is continuous in 9, 
that is, for 1(9) = (1^(6); i,j = 1, . . . ,p), 

hm J2 l*tf(*»)--M*)l=0 (4-4) 

if6 n ^6. 

4- Transition probability kernel Tl n satisfies 

lim / \\U n (x n ,-) ~ N(9 n ,n- 1 I(9 n )- 1 )\\P n (dx n ) = 0. 



5. There exists a measurable map 9 n : X n — > such that r„ := n 1 ^ 2 (9 n — 9 n ) 
is P n -tight. 

6. Q is a probability measure with density q with respect to the Lebesgue 
measure. The function q is continuous and strictly positive everywhere. 

Take 

li n (x n ,A)^Q(n 1 ' 2 (A-9 n )), 

Let II* and \i* n be localizations ofH n and fi n with respect to ip n (x n ). Then II* 
and /i* satisfies (4-3). The proof will be given below. Then if N n is a random 
Markov measure defined by (fj. ni K n ), M n = (N n , e n ) (n = 1, 2, . . .) is also locally 
consistent to (II„; n = 1,2,...). 

Now we prove (4-3) for II* and \i* n . We already know that 

lim I \\Yl* n (x n ,-) - N(Q,I(6 n )- l )\\P n (dx n ) =0 
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and for a Borel set A of TV, 

H* n (x n , A) = n n (x n , 6 n + n- 1/2 A) = Q(A + r n ). 
Let B r := {9 € R p ; \0\ < r}. For any e > 0, there exists a R > such that 
limsupP„(0„ £ B R ) < e/2, lim sup P„(r„ £ B R ) < e/2, Q(B C R ) < e/2. 

n— >oo n— >oo 

Si/ continuity and positivity of probability distribution functions, there exists 
constants c* , c* such that 

inf ^(^O,/^)^ 1 ) > c* > 0, sup < c* < oo. 

Take E n = {x n ;6 n (x n ),T(x n ) e B R }. If x n £ E n , 

A) = Q(A + r n ) < Q((A n Bl R ) + t„) + n B 2 h) + ?"„), 

and also 

Vn(xn,A) <Q(B R ) + c*Lcb(AnB 2R ) <e/2+- [ cf>{u; 0, ify-^du. 
TVow we set F„ = {x„; \\N(0, 10 n )- X ) - II* (x„, -)|| < (c*/c*)(e/2)}. TTiera /or 

M ;(x n ,A)<e + -n;(^). 

c* 

Smce limsupn^QQ P n {{E n n -F n ) c ) < e, (^-<5/ feoZrfs /or II* and /i* . 

Remark 4.14. /n i/iis example, fi n can be computed by the knowledge of Q and 
9 n . For example, if we can construct a y/n- consistent estimator, and if there 
exists (but can not perform ) a locally consistent stationary Markov chain Monte 
Carlo procedure, then we can construct a non- stationary Markov chain Monte 
Carlo procedure starting from the y/n- consistent estimator with Q. 

Remark 4.15. Usually we do not take y/n- consistent estimator as a starting 
point as above remark. On the other hand, we do not take the point to be far from 
the "center" of target distribution but try to set it to be close to the center. The 
choice of y/n- consistent estimator is not only a recommendation of the choice 
but also one formalization of the above usual choice. 

5 Asymptotic statistics and quadratic mean dif- 
ferentiability 

This section provides technical results which will be used later. It may be 
possible to skip this section and go back if the reader find difficulty reading 
latter section. 
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5.1 Quadratic mean differentiability 

Let (X, X) be a measurable space and 9 c R p be an open set equipped with 
Borel cr-algebra S. Let (j,(dx) be cr-finite measure on (X, X). Assume also that 
there exists a X <g> S-measurable function p(x\8) such that 

P(dx\6) = p(x\9)v(dx) (5.1) 

where P(dx\9) is a probability transition kernel from to X. 

Definition 5.1. P{dx\9) is called quadratic mean differentiable at 9 if there 
exists a TV -valued function rj(x\9) such that for h G R p , 

/ y P {x\e + h)- ^ P ~^\&)-h T T 1 {x\e)\ 2 v{dx) = o{\h\ 2 ) 

Jxex 

tf\h\->0. 

We call T](x\6) a quadratic mean derivative of P(dx\6) at 9. Let fj(x\6) = 
2r](x\9)/ y / p(x\6), which is called a score function. Let Z n (x n \d) = riT x l 2 Y^i=i v( x 
for x n = {x , . . . ,x n ). Fisher information matrix 1(9) is defined by 

1(6) =4 [ r](x\6)ri(x\6) T v(dx) = [ f}(x\6)? t (x\0) T P(dx\9). 

JxeX Jx£X 

Note that r](x\9) is square intcgrablc with respect to v if P(dx\9) is quadratic 
mean differentiable at 9 and hence 1(9) exists. Quadratic mean differentiability 
provides a lot of important results with minimal assumptions. For example if 
1(9) is not singular, the law of Z n (x n \9) tends to N(0,I(9)) under P n (dx n \9) = 
P(dx l \9). See excellent monographs such as [10] and [8]. In this paper, we 
use the convergence of posterior distribution, which comes from consistency of 
the posterior distribution and local asymptotic normality of the likelihood. 

Let (X n , X n , P n (dx n \9)) be n-th product of (X, X, P(dx\9)). Let A be a 
probability measure and P n (dx n ) = f & P n (dx n \9)A(d9). We assume the exis- 
tence of the probability transition kernel P n (d9\x n ) from X n to such that 

P n (d9\x n )P n (dx n ) = P n (dx n \9)K(d9). 

The following set of assumptions are taken from Theorem 10.1 of [18]. See 
Theorem 8.1.4 of [9] for other useful set of conditions, in particular, see (A-3,4) 
of their assumptions. 

Assumption 5.2. 1. P(dx\9 x ) ^ P(dx\9 2 ) if 9 X ^ 9 2 . 

2. 1(9) is non-singular for any 9 G and continuous. 

3. For any 9q G and e > 0, there exists a sequence ip n : X n — > [0, 1] such 
that for B, = {9; \9\ < e} 

lim / i/j n (x n )Pn(dx n \0o) = 0, lim sup / 1 - ip n (x n )P n (dx n \9) = 0. 

n->ooJ n^co geB r ; J 
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4- A has derivative A with respect to the Lebesgue measure and A is continuous 
and positive. 

For fixed 6 G 6, write 9 n = 9 + ii^ 1 / 2 I{9y 1 Z n {x n \9). Let 9 n be a central 
value of P n {d9\x n ). 

Proposition 5.3. Under Assumption 5.2, if P(dx\6) is quadratic mean dif- 
ferentiable at any 9 G 6, then for any 9 £ 0, \/n(9 n — 9 n ) tends to in 
P n {dx n \9)- probability. Moreover, 

lim [ \\P n (d9\x n )-N(9 n ,n- 1 I(9 n )- 1 )\\P n (dx n ) = 0. (5.2) 

Proof. Fix 6 0. Consider a probability space (Xqo, A^, Pcxd(^oo|^)) which 
is a countable product of (X, X , P(dx\9)). Consider x n = (x 1 , . . . ,x n ) as a 
subsequence of Xoo = (x 1 , . . .). Let P*(d9\x n ) be a localization of P n (d9\x n ) by 
M> n x / 2 {9 — 9 n ). Then by Bernstein- von Mises's theorem, 

lim f \\P*(d9\x n ) - N(0,I(9)- 1 )\\P oo (dx oo \9) = 0. (5.3) 

Hence for any subsequence of N, there exists a further subsequence n\ < n-i < 
. . . such that for P^ (cfooo 1 0)-a.s. w(P* z {d9\x nz ), N(0, 1(9)- 1 )) for i oo. 

Write T n = n x / 2 (9 n — 9 n ) the central value of P*(d9\x„). Since the central 
value is continuous in weak convergence, r ni tends to in Poo(rf^ool^) almost 
surely hence r„ tends to in P oo (da; oo |0)-probability. Therefore the former claim 
follows. By continuity of /, 

\\N(0,I(9)- 1 )-N(T n ,I(9 n )- 1 )\\ -+0 

in PoQ^dxoa |#)-probability and by the convergence (5.3), we obtain 

J \\P*{de*\x n ) - N(T n , I(§ n )~ 1 )\\P n (dx n \6) -> 

for any 9. Then integrating the right hand side by A, the latter claims follows 
by the dominated convergence theorem. □ 

Under P n (dx n \9) we can construct the following table. This table means 
that those statistics with the same column is equivalent under P n (dx n \9), that 
is, if A n and B n are in the same column, A n — B n tends in P n (dx n |0)-probability 
to 0. 

Under P n (dx n \9), we prefer to use left hand side statistics. Under P n (dx n \9)A(d9) = 
P n (d9\x n )P n (dx n ), we will use the right hand side. We will use both represen- 
tation depending on the situation. 
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Likelihood statistics 



Posterior statistics 



9, e n = 9 + n 1 ^I(9)- 1 Z n (x n \e) 
1(9) 




Table 1: Equivalent statistics 



5.2 Quadratic mean differentiability of marginal model 

When we use a Gibbs sampler we usually choose the probability transition 
kernel (or model) P(dxdy\9) from simple parametric family such as exponential 
family. If P(dxdy\9) is an exponential family, quadratic mean differentiability 
is quite easy to show. On the other hand, quadratic mean differentiability of 
P(dx\9) is sometimes not easy even if P(dxdy\9) belongs to an exponential 
family. In this subsection, under a certain condition, we show that quadratic 
mean differentiability of P(dx\9) comes from that of P(dxdy\9). 

Let (X, X) and (Y, y) be measurable spaces and O C R p be an open set 
with Borel tr-algebra 3. Let v(dxdy) be er-finite measure on (X x Y, X <g> y). 
Assume that there exists a transition kernel v(dy\x) from X to Y such that 
v(dxdy) = v(dx)v(dy\x) where v(dx) = j y€Y v(dxdy). 

Now we forget the assumption for quadratic mean differentiability of P(dx\9) 
and we show the condition from that of P(dxdy\9). Assume the existence of a 
X x Y x 0-mcasurable function p(xy\9) such that 



Then (5.1) holds for p(x\9) = j yeY p(xy\9)v(dy\x). Assume P(dxdy\9) is quadratic 
mean differcntiablc at 9 with quadratic mean derivative rj(xy\9). Set 



Proposition 5.4. Assume P(dxdy\9) is quadratic mean differentiable at 9 and 
for any A G X <E) y, for any 9\^2 G O, 



Then P(dx\9) is quadratic mean differentiable at 9 having quadratic mean deriva- 
tive r](x\9) defined in (5.4)- 

Proof. For h € R p , let 



P(dxdy\9) = p(xy\9)v(dxdy). 




(5.4) 




Rh(xy) 
rh{x) 



y/p(xy\e + h) - y/p(xy\e) - h T r,(xy\9) 



y/p(x\e + h)- y/p(x\0) - h T V (x\9). 
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By assumption, J \Rh{xy)\ 2 v(dxdy) = o(\h\ 2 ). We show J \rh{x)\ 2 v{dx) = 
o(\h\ 2 ). For any e > 0, divide X into three subsets A = {x;p(x\9) = 0}, 
Ax = {x;p{x\6) G (0,(5)} and A 2 = {x;p(x\9) G [<5,oo)} where <5 = (5(e) will be 
defined later. 

Fist step We show J A \rh(x)\ u(dx) = 0. By dehnition, p{x\ff) = and 
rj(x\6) = for x <E Aq. Moreover 

= / P{dx\6)= J P{dxdy\9)^ j P(dxdy\0 + h) = 0, 

Jx£A JAqXY JAoxY 

hence f A P(dx\6 + h) = J A p(x\6 + h)v{dx) = 0. Therefore r h (x) = for 
v-a.e. in Aq which proves the first claim. 

Second step We show limsup^^^ \h\~ 2 f Ai \rh(x)\ 2 v(dx) < e for suitable choice 
of S > 0. Set a~(x) = y/p{x\6 + h)- ^/p(x\9), A~{xy) = yjp{xy\0 + h) - 
y/p{xy\0) and b(x) = h T rj(x\0) and B(xy) = h T r](xy\9). Since r/, = a~ — b, 
by Schwartz's inequality and Minkowskii's inequality, |r^(x)| is bounded 
above by 

\a-(x)\ + \b(x)\ < (J \A-(xy)\ a u(dv\x)) 1/2 + ( J \B(xy)\ 2 u(dy\x))^ 2 . 
Moreover, since A~ = Rt + B we have 

\r h (x)\ < (jf \R h {xy)\ 2 u{dy\x))^ 2 + 2(J \B{xy)\ 2 u{dy\x))^ 2 . 
Since \h\- l \B{xy)\ < \r,{xy\9)\, for A x = {p(x\6) G (0,6)}, 

limsupH- 2 / \r h (x)\ 2 <4:[ \r){xy\6)\ 2 v{dxdy). 

|fe|->0 J Ai JAiXY 

By dominated convergence theorem, we can take 6 small enough to be the 
right hand side is smaller than e. Hence the second claim follows. 

Third step We show lim^oo \h\~ 2 J A \r h (x)\ 2 v(dx) = 0. Let A+(xy) = 

y/p(xy\9 + h) + ^p{xy\9) and o+{x) = ^fp{x\9 + h + yjp(x\6). Since 
A- = R h + B, 

_ g-{x)a+{x) f A-{xy)A+{xy) 

^(xyJA+Cav) + f B { xy)A+(xy) 



a + (x) Jy a + (x) 

and denote sq(x) for the first term of the right hand side. Since A + (xy) 
A~(xy) + 2y/p(xy\9), the second term becomes 



B(xy)A-(xy) f B{xy)^p{xy\9) 

-v{dy\x) + 2 / — v{dy\x) 



a + {x) Jy a + {x) 
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and denote si(x) for the first term of the right hand side. The second term 
can be simplified by using the relation j Y B(xy) \/p(xy\9)/ y/ 'p(x\9)v(dy\x) - 
b(x). Using this relation, the second term minus b(x) becomes 



\/ p(x\9) , . , , . , , , a (x) , . 

2 V +> \ J b(x) - b(x) = -b(x)—±-± =: s 2 (x). 
a + (x) a + (x) 

Hence = a~ — b = sq + si + S2- The order of the integrals are given in 
the following table. 

o(i) Q(N 2 ) o(|fe| 2 ) 

v(dxdy)-mtegT&\ \A~^ \A~W\B\ 1 \R h \ z 
i/(dx)-integral |a~| 2 ,|£>| 2 

The table means that for example, since |^4 _ | 2 is categorized in v(dxdy)- 
intcgral and 0(|/i| 2 ), 



\A-{xy)\ 2 v{dxdy) = 0(\h\ 2 ). 



XxY 



Since a + (x) > S for x £ A2 , we do not have to care degeneracy of denom- 
inator. By Schwartz's inequality, (J A \so(x)\ 2 v(dx)) 1 ^ 2 is bounded above 
by 

6-\[ \R h {xy)\ 2 v(dxdy)) x ' 2 {f \A + (xy^^dxdy)) 1 ' 2 = o(\h\). 



Similarly, (j^ | Sl (*)|M^)) 1/2 = 0(\h\ 2 ) = o(\h\) and (j^ \s 2 (x)\ 2 u(dx)f/ 2 - 
0(\h\ 2 ) = o(\h\). Since (j^ \r h (x)\ 2 ^(dx))^/ 2 < YLoUa, \si{x)\ 2 v{dx))- x / 2 
o(\h\), the third claim follows. 

Therefore 

2 

limsup|/i|~ 2 / \rh(x)\ 2 is(dx) = limsup \h\~ 2 >J / \rh(x)\ 2 u(dx) < e. 

fe->0 J h~>0 i=Q JAi 

This proves the proposition. □ 



5.3 Convergence of normalized partial score 

Here we assume the condition in Proposition 5.4 and define rj(x\9) as in (5.4). 
Hence both P(dxdy\9) and P(dx\9) are quadratic mean differentiable having 
score functions fj(xy\6) and fj(x\9) and Fisher information matrices K(9) and 
1(9) with respectively. Let fj(y\x, 9) = fj(xy\6) - fj(x\9) and J(9) = K(9) - 1(9). 
Note that / fj(y\x, 9)P(dy\x, 9) = and 

J f ] (y\x,9)f,(y\x,9) T P(dxdy\9) = J(9). 
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Set 

n n 
i=l i=l 

and Z n (y n \x n ,9) = Z n (x n , y n \9) — Z n (x n \9). We define a probability transition 
kernel Q from X n x to by 



Q n (x n ,9,A) = / l A (Z n (y n \x n ,9))P n (dy n \x n ,9). (5.5) 

JVn£Y n 

Proposition 5.5. Assume the condition in Proposition 5.4- Suppose that J{6) 
are non-singular and K{6) is continuous in 6. Then 

lim fw(Q n (x n ,9,-),N(0,J(9)))P n (dx n \9) = 0. 

n— >oo J 

Proof. Let (X^, X^, Poo(<£eoo|#)) be a countable product of (X, X, P{dx\9)). 
Consider x n = (x 1 , . . . , x n ) as a subsequence of Xoo = (x 1 , x 2 , . . .). By the law 
of large numbers, for P 00 (iix 00 |£?)-a.s., 

n 

limn-T fj(tf\x\9)f,(y t \x\9) T P(dy*\x\9) = J{9) (5.6) 

and for 4> c {x) = \x\ l{|a;|>c}) for any c > 0, for P oo (dx oo \0)-a,.s., 

""'E/ 4>My l \x\9))P(dy l \x\9) -> f cf>My\x,0))P(dy\x,9). (5.7) 
i^Jy'eY JyeY 

Let be a subset of such that the convergences (5.6) and (5.7) holds for 
c G Q + = {s/i;,s,i G N}, that is, is the set satisfying Lindcberg condition. 
Then J A Poo((&Coo|0) = 1 and since Lindeberg condition holds, 

w(Q n (x n ,9,-),N(0,J(9))^0 

for any Xoo G Aqq. Hence the claim follows. □ 



6 Local consistency for standard Gibbs sampler 

We study local consistency for Gibbs sampler for independent and identically 
distributed observations. We want to remark that in general, if it does not sat- 
isfy regularity conditions, a sequence of standard Gibbs samplers is not always 
locally consistent. For example, a sequence of usual Gibbs sampler on probit 
regression model is not locally consistent, which is proved in [7]. This incon- 
sistency partly explains the poor behavior of the Markov chain Monte Carlo 
procedure. 
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6.1 Sequence of standard Gibbs sampler 

In the latter subsection, we consider local consistency of Gibbs sampler for i.i.d 
setting. We prepare some notation related to Gibbs sampler in this subsection. 
First we define standard Gibbs sampler for a general space. After that, we 
define standard Gibbs sampler for more specific situation, more precisely, for 
i.i.d. setting. 

Let (Q,d) be a complete separable metric space equipped with a Borel 
a- algebra 3 and let (X,X,P) and (Y,y) be a probability space and a mea- 
surable space with respectively. Set (S,S) := (Y,y) <g> (0,2). Write an ele- 
ment of S by s — (y,0). Assume that there are probability transition kernels 
P(dy\x,9),P(d9\x,y),P(d9\x) and P{dy\x) such that for P-a.s. x, 

P(dy\x, 6)P{dO\x) = P(d9\x, y)P{dy\x). (6.1) 

When the above relation holds, we define a probability transition kernel K from 
X x S to S and another probability transition kernel K from X x to such 
that 

K(x,s,ds*) = P{dy*\x,8)P(d9*\x,y*) (6.2) 
for s = (y,6) and s* = (y*,6*) and 

K(x,9,d9*)= [ P(dy\x,6)P(d9*\x,y). (6.3) 

JydY 

Note that K(x,s,ds*) does not depend on y. 

We show that P(dO\x) is the invariant probability transition kernel of K. See 
the end of Section 4.1 for the definition of the invariant probability transition 
kernel. 

Proposition 6.1. Under (6.1), the invariant probability transition kernel of K 
is U(x,d9) = P{d0\x). 

Proof. Without loss of generality, we can assume (6.1) for any x E X. By 
definition, for any A G S, 

f U(x,d9)K(e,d6*)= I P(d9\x)P(dy\x,9)P(d9*\x,y). 

Using (6.1), we can integrate out 9 and then using (6.1) again, we can also 
integrate out y. This calculation yields 

{UoK){x,A) := f U(x,d9)K(9,d9*)= I P(d9*\x) = U(x,A). 

Since 3 is countably generated, there exists a subset S having countable number 
of elements which generate 5. Then 

X :={xe X- (n o K)(x, A) = II(x, A) (VA G £)} 
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has measure 1 under P(dx). Now fix an element i of 1. Then a subset of 5 
denned by 

Z = Z x :={Ae 5; (II o K)(x, A) = Il(x, A)} 

is cr-algebra which contains S. Hence 2 = 3 for any x € X. This means that if 
x £ X, then II o K(x, •) = n(x, •) and P(dx) = 1, that is, II is an invariant 
probability transition kernel of K. □ 

Now we define standard Gibbs sampler. Let e = (e m ;m = 1,2,...) be a 
sequence of probability transition kernels e m defined by 

^ m— 1 

e m (s m , ■) = — \ 5gu\ 
m t—' 

4=0 

where s m = (s(0), . . . , s(m — 1)) e S m and s(i) = (y(i),9(i)). We call e a 
sequence of empirical distribution on 0. 

Definition 6.2. Assume that there are probability transition kernels 

P(dy\x, 6),P(d6\x, y), P{d6\x), P(dy\x) 

such that (6.1) hold. Let M be a random Markov measure generated by (jt, K) 
for~p(x,ds) = P(d6\x)P{dy\x,6) and~K defined by (6.2). ThenM = (M,e) is 
called a standard Gibbs sampler on (X,X,P) on (S, 0) when e is a sequence of 
empirical distribution on 0. 

Now we concentrate on Gibbs sampler for more specific setting, i.i.d. setting. 
Let (X, X), (Y,y) be measurable space and (0,<i) be a complete and separable 
metric space with a Borcl er-algebra S. A probability measure A is defined on 
(0, 3). Let P(dxdy\9) be a probability transition kernel from to X x Y such 
that there exists probability transition kernels P(dy\x, 9) and P{dx\0) satisfying 
P(dxdy\9) = P(dy\x, 9)P(dx\9). A sequence of standard Gibbs sampler will be 
constructed by these probability measure and probability transition kernel. 

Let (X n X n ) and (Y n , y n ) be n-th products of (X, X) and (Y, y). Write their 
elements by x n = (x , . . . , x n ) and y n = (y 1 ,...,?/™) with respectively. We 
define probability transition kernels 

n n 

P n (dx n \9) =Y[P(dx i \6), P n (dx n dy n \9) =]JP(dx i dy i \9), 

i=l i=l 
n 

P n (dy n \x n ,9) = l[P(dy l \x\9), 

4=1 

and probability measures 

P n (dx n ) = [ P n (dx n \9)A(d9), P n (dx n dy n ) = [ P n (dx n dy n \9)A(d9). 
Je Je 
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Assume that there are probability transition kernels P n (d9\x ni y n ) and P n (d9\x n ) 
satisfying 

P n {d9\x n ,y n )P n (dx n dy n ) = P n {dx n dy n \B)K{d9) (6.4) 

and 

P n (d6\x n )P n (dx n ) = P n {dx n \8)K{d8). (6.5) 
Moreover, we assume (6.1) for these transition kernels, that is, 

P n (dy n \x n ,8)P n (d9 n \x n ) = P n (d9\x n , y n )P n (dy 

Note that this relation is automatically satisfied if X is countably generated. 
Let (S n ,S n ) = (Y n ,y n )®(G,E). 

Definition 6.3. Assume (6.1,6.4,6.5). Let M n = (M n ,e n ) be a standard Gibbs 
sampler on (X n , X n , P n (dx n )) on (S n ,Q) defined by 

P n (dy 9), P n (d9\x n ,y n ), P n (d9\x n ), P n (dy n \x n ). 

Then (M n ; n = 1, 2, . . .) is called a sequence of standard Gibbs sampler gener- 
ated by P(dxdy\9) and A. 

Later, we will consider analysis of A4 n . For that purpose, it is convenient to 
consider alternative equivalent Monte Carlo procedure. Let e be a sequence of 
empirical distribution and let a probability transition kernel K n from X n x n 
to 6„ be 

K n (x n ,6,d6*)= [ P n (dy n \x n ,e)P n {d6\x n ,y n ). (6.6) 

Jy n £Y n 

Let M n be a random Markov measure generated by (II„ , K n ) where II n (x n , d9) = 
P n (d9\x n ). Then M n = (M„,e) is equivalent to M n - We refer to the Markov 
chain Monte Carlo procedure Ai n by minimal equivalent Markov chain Monte 
Carlo procedure. 

6.2 Approximation of the standard Gibbs sampler 

In this subsection, we fix 9q € C R p and all arguments are under P n {dx n \9f)) 
and P n (dx n dy n \9o). We assume the same condition as Section 5.3. Table 2 for 
equivalent statistics is useful, which is an extension of Table 1. 

Write the central values of P n {d9\x n y n ) and P n (d9\x n ) by 9 n (x n ,yn) and 
9 n {xn) with respectively. In the following, we write A = a B if A — B tends 
in P„(o?x„o??/„|6'o)-probability to 0. Then by Table 2 and by Ig 1 = A' e _1 (Je + 
Ie)^ 1 = K^Jer 1 +K~\ 

n 1/2 (9 n (x n ,y n ) - 9 n (x n )) = a K g ^ Z n (x n ,y n \9 ) - I^ 1 Z n (x n \9 Q ) 

= a Kg o l Z n {y n \x n , 9 ) - Kg o l Je I^Z n {x n \e ) 
= a K d ^Z n (y n \x n , 9 ) + n x l 2 K^J ea {^ - 9 n (x n )) 
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Likelihood statistics 


Posterior statistics 


0o,0 n (x n ),9 n (x n ,y n ) 


®n ) i @n {%n ) Un) 


i{e Q )- l z n ( Xn %) 


n 1 ' 2 {9 n {x n )-9 ) 


K(9 )- 1 Z n (x n ,y n \9 ) 


n 1 / 2 (9 n (x n ,y n )~9 ) 


Wo) 


I(9n) 


K(6 ) 


K(9 n ) 


J(0 ) 


J(9n) 



Table 2: Equivalent statistics 



where Lg = L{6q) for L = I. J and K. By Proposition 5.5, the law of 
Z n (yn\x n ,9o) tends to N(0, J(9q)). Hence, formally, we replace Z n (y n \x n ,9o) 
by £i which follows N p (0,I) (where / is the p x p identity matrix), that is, 

n 1/2 (9 n (x n ,y n ) - 9 n {x n )) ~ K^J^ 2 ^ + n x l 2 K^ J 0O (9 - 9 n {x n )) 

where ~ means "similar" in certain sense (just a formal argument). Since 
P n (d9\x n ,y n ) tends to N(6 n (x n ,y n ),n^ 1 K(8o)~ 1 ), the realization 9* from P n (d9\x n , y n ) 
satisfies 

nV 2 (6* -§ n (x n ,y n )) ~ 

where £2 follows N p (0,I). Hence 

n 1 ' 2 {9* - 9 n (x n )) ~ + K£j%% + n 1 ' 2 K^ 1 Jg (9 - 6 n {x n )) 

where £1 and £2 follows N p (0,I) independently. Therefore we approximate 
K n (x n ,9 ,d9*) defined by (6.6) by 

N(9 n {x n ) + Kg 1 J ea (9 Q - 9 n (x n )), n^K^ 1 + n^K^J^K^ 1 ). 

By replacing /, J, K at 9o by I, J, K := /, J, K at 9 n (x n ) 

K n (x n ,9 , ■) := N(9 n {x n ) + K- 1 J(9 - 9 n (x n )), n^R- 1 + n^K^JR- 1 ). 

Since IL n (x n ,d6) = P„(d9\x n ) is approximated by N(8 n (x n ),n I), 

U n (x n ,-) :=N{9 n {x n ) 1 n~ 1 I). 

We approximate M n by a random Markov measure M n generated by (n„, K n ). 

6.3 Local consistency of the standard Gibbs sampler 

We study local consistency of a standard Gibbs sampler. Before stating the 
result, we make one remark for initial probability transition kernel. For fixed 
observation, the standard Gibbs sampler uses the posterior distribution as an 
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initial distribution, which is unrealistic. However as mentioned in Section 4.3, 
we can replace the initial distribution by small perturbation from n 1//2 -consistent 
estimator as in Example 4.13. 

Let 9 n (x n ) be a central value of P n (d6\x n ). We consider localization by 
9^n x / 2 (9-9 n (x n )). 

Theorem 6.4. Assume the condition in Proposition 5.4 and Assumption 5.2. 
Suppose that 1(9) and J(9) are non-singular and 1(6), K (6) are continuous in 
6. Then the standard Gibbs sampler (M. n ; n = 1,2,...) is locally consistent to 
(n„;n= 1,2,...) 

Proof. It is sufficient to study a sequence of minimal equivalent Markov chain 
Monte Carlo procedure M. n = (M n ,e) defined after Definition 6.3. First we 
show 

lim \\U n ®K n (x n ,-)-t\ n ®K n (x n ,-)\\P n (dx n ) = Q (6.7) 

n— >oo J 

where H n and K n are defined in the previous subsection. By triangular inequal- 
ity, we have 

||n„ ® K n - n„ ® k n \\ < ||n„ ® K n -n n ® k n \\ + ||n„ ®k n - n„ <g> k n \\ 

and the second term of the right hand side is bounded by 1 1 II „ — II„ 1 1 which tends 
in P„((ia; ra )-probability to by Proposition 5.3. Since Il n (x n ,d&) = P n (d6\x n ), 
the first term integrated by P n (dx n ) is bounded by 

J \\(K n -k n )(x n ,6,-)\\P n (d6\x n )P n (dx n )= J \\(K n -k n )(x n ,6,-)\\P n (dx n \6)A(d6). 

We fix 6 6 and consider the convergence of the integrand of A. Use the 
likelihood statistics in the sense of Table 2. To show this convergence, we make 
two probability transition kernels L n and L n and consider inequality 

\\K n - K n \\ < \\K n - L n \\ + \\L n - L n \\ + \\L n - K n \\. (6.8) 

First, we construct L n by 

L n (x n ,6,-) -Nidnix^-n-^K^Jel^Z^x^n^K^+n^K^JeKg 1 ) 

where 8(x n , y n ) = 9 + n~ 1 / 2 Kg 1 Z n (x n ,yn\9). Then the integral of the last term 
of the right hand side of (6.8) tends to since the differences between L n and 
K n are made by asymptotically equivalent statistics (see Table 2). Second, we 
make L n by 

L n (x n ,9,d9*)= [ P n (dy n \x n ,9)4>(9*;e n (x n ,y n ),n- 1 Kg 1 )d9*. 

Then the first term of the right hand side of (6.8) integrated by P n (dx n \9) is 
bounded by 

J \\P n (d9\x n ,y n ) - N(9 n (x n ,y n ),n- l K g l )\\P n (dx n dy n \9) -> 
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which is a consequence of Bernstein- von Mises's theorem for P(dxdy\0). Third, 
we consider the middle term of the right hand side of (6.8). Now we make a 
localization by 

6* h+ 7i l l 2 {e*-6{x n )+n- 1 / 2 K 9 1 JgI d 1 Z n (x n \e)) = n^^-O-n-^K^Z^x^e)). 
Then localizations of L n and L n are L* n (x n , 9, •) = N(0, Kg 1 + RJ 1 JeKg 1 ) and 

L* n (x n ,6,du) = / P n (dy n \x n ,0)(j)(u]Kg 1 Z n (y n \x n ,0),K g ~ 1 )diL. 
Note that 

L* (x n , 9, du) = \ <p(v]Q,Jg)dv(j){u;K g ~ 1 v,K g ~ 1 )du. 

Take B r = {x € R p ; \x\ < r} and set 

ipu(v) = (j)(u;Kg 1 v,Kg 1 ) 

which is a Lipschitz continuous function with Lipschitz constant c(6) > 0. Then 
for M > 0, || (L* - L* )(x n , 9, -)|| is bounded above by 

L^(x 7l7 9 7 B c M )+ \ ifj u (v)Q n (x n ,9,dv)- {v)4>(v; 0, Jg)dv\du 

Ju£B M JveR p Jv£Rp 

where Q n is defined by (5.5). The latter is bounded above by 

Leb(B M )c(9)w(Q n (x n , 9, dv),N(0, J e )). (6.9) 

Since L* {x n , 9, du) does not depend on x n , we can take M large enough to be 
L* n (x n , 9, B C M ) < e (say) and integral of (6.9) tends to by Proposition 5.5. 
Hence / ||L„ — L n \\P n (dx n \9) — > for any 9, which completes the proof of (6.7). 

Now we make a localization 9 H> v}l 2 {9 — 9{x n )). Unfortunately, since n* 
and K* are random, we can not directly use Propositions 4.8 and 4.9 to conclude 
local consistency of M n - However since P n (dx n ) = P n (dx n \9)A(d9), 

lim [ \\Ul^K*(x nr )-fl* n ^K*(x n ,-)\\P n (dx n \9)^0 (6.10) 

n— >oo J 

in A-probability, and for each 9, we can replace n* (x n , •) and K*(x n ,-) by 
non-random kernels n and K where 

ft(-) = N{Q,I- l {6)), K{u,-) = N(K- 1 (9)J(9)u,K- 1 (9) + K- 1 (9)J(9)K- 1 (9)). 
Therefore, 

lim [ \\Il* n ®K*{x n ,-)-fl®K\\P n (dx n \9)^Q (6.11) 
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in A- probability. Fix m n — > oo. For any subsequence of N, there is a further 
subsequence n\ < 712 < ... such that the above convergence holds for A-a.e. 
8 replacing n by n^. Then we can apply Propositions 4.8 and 4.9 to conclude 
consistency of VW*. = (M*.,e m ) under P n (dO\x n ), that is, 



/ / 



w(e mn . (On), n* (x ni ))M* {x ni , ffl^Pm {dx nz \6) -> 



for A-a.e. 9 where M* is a localization of M n . Since the convergence is true for 
some subsequence choosing from any subsequence of N, we have 

/ / ifl(e n ^(fl oo ),n^(a n ))M^(x nj tW oo )P„(dar n )-»-0 

which means consistency of M* n , that is the desired conclusion. □ 



7 Discussion 

In this paper, we defined Monte Carlo procedure and Markov chain Monte 
Carlo procedure as a set of probability measure on S N ° with a sequence of map 
from finite product of S to 0. In particular, we studied local consistency for a 
sequence of standard Gibbs sampler under regularity conditions. This property 
is a good behavior of a sequence of Markov chain Monte Carlo procedure. 
What we did not discuss in this paper was the following. 

1 . Research for poor behavior analysis of a sequence of Markov chain Monte 
Carlo procedure. In fact, if the sequence has a good property, we do 
not have to tune up the algorithm since we already have a good Monte 
Carlo procedure. The poor behavior can be studied degeneracy and local 
degeneracy of Markov chain Monte Carlo procedure. Moreover, we can 
define a rate of convergence. This research will be studied in a separate 
work such as for mixture model and categorical data model. 

2. Research for constructing new Monte Carlo procedure. Unfortunately, the 
analysis in the paper is for usual Monte Carlo procedures. We believe that 
these analysis is useful for constructing new Monte Carlo procedures. The 
paper [6] shows one possibility. 

3. We do not consider point estimation but posterior approximation. This 
is just for simplicity. Let (A4 n = (M n ,e);n = 1,2,...) be consistent to 
(n„; n = 1,2,...) where e is a sequence of empirical distribution. Then 
it is easy to show that if M n is stationary, then mT 1 X)"=o ^(*) tends to 
/ 9Tl n {x n ,d9) when (E£(a; n ,d0) := \6\Tl n {x n ,d6);n = 1,2,...) is tight. 
When M n is not stationary, and (4.3) holds, the same conclusion holds if 
we make burn-inn. Note that without burn-inn, it may not be true. 
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